Introduction

As the market for original content from various streaming services has grown throughout the years, the question why these companies are following this route arises. We want to statistically check whether the route of more and more original content instead of licensed content is promising or not.

To do this analysis, we are using scraped data from various Wikipedia pages. For this purpose we have been using an external web scraper that scrapes all the

elements from the web page and converts them into a CSV format.

The web scraping tool can be found here:

For the Netflix sub data set these web pages have been scraped:

For the Amazon Prime Video sub data set these web pages have been scraped:

For the Hulu sub data set these web pages have been scraped:

For the Disney+ sub data set these web pages have been scraped:

For the Apple TV+ sub data set these web pages have been scraped:

Disclaimer

If you run into any problems (can be recognized by missing Premiere values on the first page of the data frame called “netflix”), please make sure to contact us, because there are various options for unix-based operating systems and we did not want to include them all for the case that they might conflict with each other.

Installing required libraries

The first code chunk makes sure, that all the required libraries are installed. The “forcats” library is included with the “tidyverse” but does not open all its functionality to the developer, so it needed to be required specifically as well. To ensure correct parsing of the dates during the original content transmutation the locale is set to an english one.

Dataset Import

The upcoming chapter deals with the import of the scraped and downloaded CSVs into tibbles. Due to the huge amount and inconsistencies of the CSV files the importing process was very tedious and every single csv file needed to be inspected and merged manually. This is performed for all of the streaming services.

Custom functions for Import Phase

To exclude specific rows that do not match one of the specified statuses formats or origins, the only_keep() function was written to ease the access of parsing the factors after matching parts of the previous entry to unify the content of the column.

only_keep <- function(row, regex, factor) {
  return(parse_factor(str_match(row, regex)[,2], levels = factor))
}

status_regex <- "(Renewed|Pending|Miniseries|Awaiting release|Event|Development|Ended|Film|Special|Distribution)"

format_regex <- "(Series|Miniseries|Film|Special)"

origin_regex <- "(Continuation|Distribution|Original)"

statuses <- c("Renewed", "Pending", "Miniseries", "Awaiting release", "Event", "Development", "Ended", "Film", "Special", "Distribution")

formats <- c("Series", "Miniseries", "Film", "Special")

origins <- c("Continuation", "Distribution", "Original")

services <- c("Netflix", "Amazon Prime Video", "Hulu", "Disney+", "Apple TV+")

ages <- c("0+", "6+", "7+", "12+", "13+", "16+", "18+", "all", "unknown")

Import Netflix

netflix_originals_1 <- read_csv("../datasets/netflix/shows/table-1.csv")
netflix_originals_1 <- transmute(netflix_originals_1,
                                 Title = Title,
                                 Genre = paste(Genre, " Drama"),
                                 Premiere = Premiere,
                                 Seasons = Seasons,
                                 Runtime = Runtime,
                                 Language = NA,
                                 Status = Status,
                                 Format = "Series",
                                 Origin = "Original",
                                 Network = NA,
                                 Region = NA)
## # A tibble: 2,950 x 12
##    Title Genre Premiere Seasons Runtime Language Status Service Format Origin
##    <chr> <chr> <chr>    <chr>   <chr>   <chr>    <fct>  <fct>   <fct>  <fct> 
##  1 Stra… Scie… July 15… 3 seas… 42-78 … <NA>     Renew… Netflix Series Origi…
##  2 The … Hist… Novembe… 4 seas… 47-61 … <NA>     Renew… Netflix Series Origi…
##  3 Ozark Crim… July 21… 3 seas… 52-80 … <NA>     Renew… Netflix Series Origi…
##  4 Lost… Scie… April 1… 2 seas… 39-66 … <NA>     Renew… Netflix Series Origi…
##  5 Chil… Supe… October… 3 part… 50-64 … <NA>     Await… Netflix Series Origi…
##  6 Narc… Crim… Novembe… 2 seas… 45-70 … <NA>     Renew… Netflix Series Origi…
##  7 The … Supe… Februar… 2 seas… 40-60 … <NA>     Renew… Netflix Series Origi…
##  8 Blac… Zomb… April 1… 1 seas… 21-45 … <NA>     Renew… Netflix Series Origi…
##  9 Anot… Scie… July 25… 1 seas… 37-61 … <NA>     Renew… Netflix Series Origi…
## 10 Crim… Poli… Septemb… 2 seas… 41-47 … <NA>     Pendi… Netflix Series Origi…
## # … with 2,940 more rows, and 2 more variables: Network <chr>, Region <chr>

Import Amazon Prime Video

amazon_originals_1 <- read_csv("../datasets/amazon/shows/table-1.csv")
amazon_originals_1 <- transmute(amazon_originals_1,
                                Title = Title,
                                Genre = paste(Genre, " Drama"),
                                Premiere = Premiere,
                                Seasons = Seasons,
                                Runtime = NA,
                                Language = NA,
                                Status = Status,
                                Format = "Series",
                                Origin = "Original",
                                Network = NA,
                                Region = NA)
## # A tibble: 496 x 12
##    Title Genre Premiere Seasons Runtime Language Status Service Format Origin
##    <chr> <chr> <chr>    <chr>   <chr>   <chr>    <fct>  <fct>   <fct>  <fct> 
##  1 Bosch Dete… Februar… 6 seas… <NA>    <NA>     Renew… Amazon… Series Origi…
##  2 Hand… Psyc… Septemb… 2 seas… <NA>    <NA>     Ended  Amazon… Series Origi…
##  3 The … Alte… Novembe… 4 seas… <NA>    <NA>     Ended  Amazon… Series Origi…
##  4 Mad … Dram… January… 1 seas… <NA>    <NA>     Ended  Amazon… Series Origi…
##  5 Goli… Lega… October… 3 seas… <NA>    <NA>     Renew… Amazon… Series Origi…
##  6 Good… Hist… October… 1 seas… <NA>    <NA>     Ended  Amazon… Series Origi…
##  7 Snea… Crim… January… 3 seas… <NA>    <NA>     Ended  Amazon… Series Origi…
##  8 Z: T… Hist… January… 1 seas… <NA>    <NA>     Ended  Amazon… Series Origi…
##  9 Patr… Crim… Februar… 2 seas… <NA>    <NA>     Ended  Amazon… Series Origi…
## 10 The … Hist… July 28… 1 seas… <NA>    <NA>     Ended  Amazon… Series Origi…
## # … with 486 more rows, and 2 more variables: Network <chr>, Region <chr>

Import Hulu

hulu_originals_1 <- read_csv("../datasets/hulu/shows/table-1.csv")
hulu_originals_1 <- transmute(hulu_originals_1,
                              Title = Title,
                              Genre = paste(Genre, " Drama"),
                              Premiere = Premiere,
                              Seasons = Seasons,
                              Runtime = Length,
                              Language = NA,
                              Status = Status,
                              Format = "Series",
                              Origin = "Original",
                              Network = NA,
                              Region = NA)
## # A tibble: 227 x 12
##    Title Genre Premiere Seasons Runtime Language Status Service Format Origin
##    <chr> <chr> <chr>    <chr>   <chr>   <chr>    <fct>  <fct>   <fct>  <fct> 
##  1 The … Dram… May 2, … 10 epi… 5-7 mi… <NA>     Minis… Hulu    Series Origi…
##  2 East… Teen… June 3,… 4 seas… 22-24 … <NA>     Ended  Hulu    Series Origi…
##  3 11.2… Dram… Februar… 8 epis… 44-81 … <NA>     Minis… Hulu    Series Origi…
##  4 The … Dram… March 3… 3 seas… 45-56 … <NA>     Ended  Hulu    Series Origi…
##  5 Frea… Horr… October… 2 seas… 22-24 … <NA>     Ended  Hulu    Series Origi…
##  6 Chan… Crim… October… 2 seas… 40-42 … <NA>     Ended  Hulu    Series Origi…
##  7 Shut… Dram… Decembe… 2 seas… 40-42 … <NA>     Ended  Hulu    Series Origi…
##  8 Dime… Scie… April 4… 1 seas… 40 min. <NA>     Ended  Hulu    Series Origi…
##  9 The … Dyst… April 2… 3 seas… 44-64 … <NA>     Renew… Hulu    Series Origi…
## 10 Marv… Supe… Novembe… 3 seas… 46-53 … <NA>     Ended  Hulu    Series Origi…
## # … with 217 more rows, and 2 more variables: Network <chr>, Region <chr>

Disney+

disney_originals_1 <- read_csv("../datasets/disney/shows/table-1.csv")
disney_originals_1 <- transmute(disney_originals_1,
                                Title = Title,
                                Genre = paste(Genre, " Drama"),
                                Premiere = Premiere,
                                Seasons = Seasons,
                                Runtime = Runtime,
                                Language = NA,
                                Status = Status,
                                Format = "Series",
                                Origin = "Original",
                                Network = NA,
                                Region = NA)
## # A tibble: 194 x 12
##    Title Genre Premiere Seasons Runtime Language Status Service Format Origin
##    <chr> <chr> <chr>    <chr>   <chr>   <lgl>    <fct>  <fct>   <fct>  <fct> 
##  1 The … Spac… Novembe… 2 seas… 31–54 … NA       Renew… Disney+ Series Origi…
##  2 The … Hist… October… 1 seas… 42–52 … NA       Pendi… Disney+ Series Origi…
##  3 Wand… Supe… January… 6 epis… TBA     NA       Minis… Disney+ Series Origi…
##  4 The … Supe… March 1… 6 epis… TBA     NA       Minis… Disney+ Series Origi…
##  5 High… Musi… Novembe… 1 seas… 26–34 … NA       Renew… Disney+ Series Origi…
##  6 Diar… Come… January… 1 seas… 22–28 … NA       Renew… Disney+ Series Origi…
##  7 Fork… Anim… Novembe… 1 seas… 3–4 mi… NA       Pendi… Disney+ Series Origi…
##  8 Spar… Anim… Novembe… 2 seas… 7–12 m… NA       <NA>   Disney+ Series Origi…
##  9 Shor… Anim… January… 1 seas… 5–7 mi… NA       Pendi… Disney+ Series Origi…
## 10 Zeni… Anim… May 22,… 1 seas… 5–49 m… NA       Pendi… Disney+ Series Origi…
## # … with 184 more rows, and 2 more variables: Network <chr>, Region <chr>

Import Apple TV+

apple_originals_1 <- read_csv("../datasets/apple/shows/table-1.csv")
apple_originals_1 <- transmute(apple_originals_1,
                               Title = Title,
                               Genre = paste(Genre, " Drama"),
                               Premiere = Premiere,
                               Seasons = Seasons,
                               Runtime = Runtime,
                               Language = NA,
                               Status = Status,
                               Format = "Series",
                               Origin = "Original",
                               Network = NA,
                               Region = NA)
## # A tibble: 121 x 12
##    Title Genre Premiere Seasons Runtime Language Status Service Format Origin
##    <chr> <chr> <chr>    <chr>   <chr>   <lgl>    <fct>  <fct>   <fct>  <fct> 
##  1 For … Alte… Novembe… 1 seas… 48-76 … NA       Renew… Apple … Series Origi…
##  2 The … Dram… Novembe… 1 seas… 50-69 … NA       Renew… Apple … Series Origi…
##  3 See   Scie… Novembe… 1 seas… 49-57 … NA       Renew… Apple … Series Origi…
##  4 Serv… Psyc… Novembe… 1 seas… 29-36 … NA       Renew… Apple … Series Origi…
##  5 Trut… Lega… Decembe… 1 seas… 39-50 … NA       Renew… Apple … Series Origi…
##  6 Amaz… Scie… March 6… 1 seas… 50 min. NA       Pendi… Apple … Series Origi…
##  7 Home… Myst… April 3… 1 seas… 50 min. NA       Renew… Apple … Series Origi…
##  8 Defe… Crim… April 2… 8 epis… 45-65 … NA       Minis… Apple … Series Origi…
##  9 Dick… Peri… Novembe… 1 seas… 30 min. NA       Renew… Apple … Series Origi…
## 10 Ghos… Fami… Novembe… 2 seas… 30 min. NA       Pendi… Apple … Series Origi…
## # … with 111 more rows, and 2 more variables: Network <chr>, Region <chr>

Import Movies

movies <- read_csv("../datasets/movies.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
##   X1 = col_double(),
##   ID = col_double(),
##   Title = col_character(),
##   Year = col_double(),
##   Age = col_character(),
##   IMDb = col_double(),
##   `Rotten Tomatoes` = col_character(),
##   Netflix = col_double(),
##   Hulu = col_double(),
##   `Prime Video` = col_double(),
##   `Disney+` = col_double(),
##   Type = col_double(),
##   Directors = col_character(),
##   Genres = col_character(),
##   Country = col_character(),
##   Language = col_character(),
##   Runtime = col_double()
## )
movies <- transmute(movies,
                    Title = Title,
                    Genre = str_replace_all(Genres, ",", " / "),
                    Year = Year,
                    Age = Age,
                    Runtime = Runtime,
                    IMDb = IMDb,
                    `Rotten Tomatoes` = `Rotten Tomatoes`,
                    Netflix = Netflix > 0,
                    Hulu = Hulu > 0,
                    `Prime Video` = `Prime Video` > 0,
                    `Disney+` = `Disney+` > 0,
                    `Apple TV+` = FALSE,
                    Director = str_replace_all(Directors, ",", " / "),
                    Country = str_replace_all(Country, ",", " / "),
                    Language = str_replace_all(Language, ",", " / "),
                    Format = parse_factor("Film", levels = formats),
                    Status = parse_factor("Ended", levels = statuses))

movies
## # A tibble: 16,744 x 17
##    Title Genre  Year Age   Runtime  IMDb `Rotten Tomatoe… Netflix Hulu 
##    <chr> <chr> <dbl> <chr>   <dbl> <dbl> <chr>            <lgl>   <lgl>
##  1 Ince… Acti…  2010 13+       148   8.8 87%              TRUE    FALSE
##  2 The … Acti…  1999 18+       136   8.7 87%              TRUE    FALSE
##  3 Aven… Acti…  2018 13+       149   8.5 84%              TRUE    FALSE
##  4 Back… Adve…  1985 7+        116   8.5 96%              TRUE    FALSE
##  5 The … West…  1966 18+       161   8.8 97%              TRUE    FALSE
##  6 Spid… Anim…  2018 7+        117   8.4 97%              TRUE    FALSE
##  7 The … Biog…  2002 18+       150   8.5 95%              TRUE    FALSE
##  8 Djan… Dram…  2012 18+       165   8.4 87%              TRUE    FALSE
##  9 Raid… Acti…  1981 7+        115   8.4 95%              TRUE    FALSE
## 10 Ingl… Adve…  2009 18+       153   8.3 89%              TRUE    FALSE
## # … with 16,734 more rows, and 8 more variables: `Prime Video` <lgl>,
## #   `Disney+` <lgl>, `Apple TV+` <lgl>, Director <chr>, Country <chr>,
## #   Language <chr>, Format <fct>, Status <fct>

Import Shows

shows <- read_csv("../datasets/shows.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
##   X1 = col_double(),
##   Title = col_character(),
##   Year = col_double(),
##   Age = col_character(),
##   IMDb = col_double(),
##   `Rotten Tomatoes` = col_character(),
##   Netflix = col_double(),
##   Hulu = col_double(),
##   `Prime Video` = col_double(),
##   `Disney+` = col_double(),
##   type = col_double()
## )
shows <- transmute(shows,
                   Title = Title,
                   Year = Year,
                   Age = Age,
                   IMDb = IMDb,
                   `Rotten Tomatoes` = `Rotten Tomatoes`,
                   Netflix = Netflix > 0,
                   Hulu = Hulu > 0,
                   `Prime Video` = `Prime Video` > 0,
                   `Disney+` = `Disney+` > 0,
                   `Apple TV+` = FALSE,
                   Format = parse_factor("Series", levels = formats),
                   Status = parse_factor("Ended", levels = statuses))

shows
## # A tibble: 5,611 x 12
##    Title  Year Age    IMDb `Rotten Tomatoe… Netflix Hulu  `Prime Video`
##    <chr> <dbl> <chr> <dbl> <chr>            <lgl>   <lgl> <lgl>        
##  1 Brea…  2008 18+     9.5 96%              TRUE    FALSE FALSE        
##  2 Stra…  2016 16+     8.8 93%              TRUE    FALSE FALSE        
##  3 Mone…  2017 18+     8.4 91%              TRUE    FALSE FALSE        
##  4 Sher…  2010 16+     9.1 78%              TRUE    FALSE FALSE        
##  5 Bett…  2015 18+     8.7 97%              TRUE    FALSE FALSE        
##  6 The …  2005 16+     8.9 81%              TRUE    FALSE FALSE        
##  7 Blac…  2011 18+     8.8 83%              TRUE    FALSE FALSE        
##  8 Supe…  2005 16+     8.4 93%              TRUE    FALSE FALSE        
##  9 Peak…  2013 18+     8.8 92%              TRUE    FALSE FALSE        
## 10 Avat…  2005 7+      9.2 100%             TRUE    FALSE FALSE        
## # … with 5,601 more rows, and 4 more variables: `Disney+` <lgl>, `Apple
## #   TV+` <lgl>, Format <fct>, Status <fct>

Dataset Transmutation

Custom transmutation functions

anycharacter <- "([a-zA-Z]|-+)"

genre_to_uppercase_unique <- function(genre) {
  if (length(genre) == 0 | genre == "" | is.na(genre)) {
    return(NA)
  }
  
  curr <- ""
  genre_vector <- c()
  c <- strsplit(genre, "")[[1]]
  
  for (i in 1:length(c)) {
    if (grepl(anycharacter, c[i])) {
      curr <- gsub(" ", "", str_squish(paste(curr, c[i])))
    }
    if (!grepl(anycharacter, c[i]) | i == length(c)) {
      if (i > 1 & grepl(anycharacter, c[i-1])) {
        whole_word = strsplit(curr, "")[[1]]
        single_genre <- ""
        
        for (j in 1:length(whole_word)) {
          if (j == 1) {
            single_genre <- gsub(" ", "", str_squish(toupper(whole_word[j])))
          } else {
            single_genre <- gsub(" ", "", str_squish(paste(single_genre, whole_word[j])))
          }
        }
        
        if (is.na(match(single_genre, genre_vector))) {
          genre_vector <- c(genre_vector, single_genre)
        }
      }
      curr <- ""
    }
  }
  
  all_genres <- ""
  for (i in 1:length(genre_vector)) {
    if (i != length(genre_vector)) {
      all_genres <- str_squish(paste(all_genres, genre_vector[i], "/"))
    } else {
      all_genres <- paste(all_genres, genre_vector[i])
    }
  }
  return(all_genres)
}

genres_to_uppercase_unique <- function(genres) {
  genres_vector <- c()
  for (genre in genres) {
    genres_vector <- c(genres_vector, c(genre_to_uppercase_unique(genre)))
  }
  return(genres_vector)
}

Transmute Netflix

Transmutation of the given Netflix series and movie table

After the data has been imported into the correct columns in the “netflix_originals.Rmd” file, we are now going to transmute the data into the proper data types and into readable formats, as it is needed with multi-value cells like the Genre. For this, we needed a function that parses the value of the cells for the Genre and unifies the content, so that we can be sure that it is in a format that can be properly plotted afterwards.

After the function has been defined, we can start converting the columns of the data frame into their respective data types, so that it is easier to deal with them during the plotting process. This means, that the premiere is being converted into a Date object and the Seasons column as well as the Runtime column are converted and split up, so that there are only integer values left. The rest of the columns is just carried over.

netflix_originals <- filter(netflix_originals,
                            !grepl("Pending|pending", Title),
                            Title != "Awaiting release",
                            !grepl("Miniseries|miniseries", Title),
                            !grepl("Renewed|renewed", Title),
                            !grepl("due to premiere", Title))

netflix_originals
## # A tibble: 2,947 x 12
##    Title Genre Premiere Seasons Runtime Language Status Service Format Origin
##    <chr> <chr> <chr>    <chr>   <chr>   <chr>    <fct>  <fct>   <fct>  <fct> 
##  1 Stra… Scie… July 15… 3 seas… 42-78 … <NA>     Renew… Netflix Series Origi…
##  2 The … Hist… Novembe… 4 seas… 47-61 … <NA>     Renew… Netflix Series Origi…
##  3 Ozark Crim… July 21… 3 seas… 52-80 … <NA>     Renew… Netflix Series Origi…
##  4 Lost… Scie… April 1… 2 seas… 39-66 … <NA>     Renew… Netflix Series Origi…
##  5 Chil… Supe… October… 3 part… 50-64 … <NA>     Await… Netflix Series Origi…
##  6 Narc… Crim… Novembe… 2 seas… 45-70 … <NA>     Renew… Netflix Series Origi…
##  7 The … Supe… Februar… 2 seas… 40-60 … <NA>     Renew… Netflix Series Origi…
##  8 Blac… Zomb… April 1… 1 seas… 21-45 … <NA>     Renew… Netflix Series Origi…
##  9 Anot… Scie… July 25… 1 seas… 37-61 … <NA>     Renew… Netflix Series Origi…
## 10 Crim… Poli… Septemb… 2 seas… 41-47 … <NA>     Pendi… Netflix Series Origi…
## # … with 2,937 more rows, and 2 more variables: Network <chr>, Region <chr>
netflix <- transmute(netflix_originals, 
                                Title = Title,
                                Genre = genres_to_uppercase_unique(str_replace_all(
                                          str_squish(
                                            str_replace_all(
                                              str_replace_all(
                                                str_replace_all(
                                                  str_replace_all(
                                                    str_replace_all(Genre, "series|procedural", ""),
                                                  "Science fiction", "Science-Fiction"),
                                                "(C|c)oming-of-age ", ""),
                                              "(C|c)omedy-(D|d)rama", "Dramedy"),
                                            "(Docu|docu).*", "Documentary")
                                          ),
                                        "/| ", " / ")),
                                Premiere = as.Date(str_replace_all(str_replace_all(Premiere, ",", ""), " ", "-"), "%B-%d-%Y"),
                                Episodes = strtoi(ifelse(!is.na(str_match(Seasons, "([0-9]+) (episode*)")[,2]),
                                                  str_match(Seasons, "([0-9]+) (episode*)")[,2],
                                                  NA)),
                                Seasons = strtoi(ifelse(!is.na(str_match(Seasons, "([0-9]) (season.*|part.*|volume.*)")[,2]),
                                                  str_match(Seasons, "([0-9]) (season.*|part.*|volume.*)")[,2],
                                                  NA)),
                                Min_Time = strtoi(ifelse(!is.na(str_match(Runtime, "([0-9]+)-([0-9]+)")[,2]),
                                                  str_match(Runtime, "([0-9]+)-([0-9]+)")[,2],
                                                  str_match(Runtime, "([0-9]+)")[,2])),
                                Max_Time = strtoi(ifelse(!is.na(str_match(Runtime, "([0-9]+)-([0-9]+)")[,3]),
                                                  str_match(Runtime, "([0-9]+)-([0-9]+)")[,3],
                                                  str_match(Runtime, "([0-9]+)")[,2])),
                                Language = Language,
                                Status = Status,
                                Service = Service,
                                Format = as.factor(Format),
                                Origin = as.factor(Origin),
                                Network = as.factor(Network),
                                Region = as.factor(Region)
                               )

netflix
## # A tibble: 2,947 x 14
##    Title Genre Premiere   Episodes Seasons Min_Time Max_Time Language Status
##    <chr> <chr> <date>        <int>   <int>    <int>    <int> <chr>    <fct> 
##  1 Stra… Scie… 2016-07-15       25       3       42       78 <NA>     Renew…
##  2 The … Hist… 2016-11-04       40       4       47       61 <NA>     Renew…
##  3 Ozark Crim… 2017-07-21       30       3       52       80 <NA>     Renew…
##  4 Lost… Scie… 2018-04-13       20       2       39       66 <NA>     Renew…
##  5 Chil… Supe… 2018-10-26       28       3       50       64 <NA>     Await…
##  6 Narc… Crim… 2018-11-16       20       2       45       70 <NA>     Renew…
##  7 The … Supe… 2019-02-15       20       2       40       60 <NA>     Renew…
##  8 Blac… Zomb… 2019-04-11        8       1       21       45 <NA>     Renew…
##  9 Anot… Scie… 2019-07-25       10       1       37       61 <NA>     Renew…
## 10 Crim… Poli… 2019-09-20        7       2       41       47 <NA>     Pendi…
## # … with 2,937 more rows, and 5 more variables: Service <fct>, Format <fct>,
## #   Origin <fct>, Network <fct>, Region <fct>
rm("netflix_originals")

Transmute Amazon Prime Video

Transmutation of the given Amazon Prime series and movie table

amazon_originals <- filter(amazon_originals,
                            !grepl("Pending|pending", Title),
                            Title != "Awaiting release",
                            !grepl("Miniseries|miniseries", Title),
                            !grepl("Renewed|renewed", Title),
                            !grepl("due to premiere", Title))

amazon_originals
## # A tibble: 495 x 12
##    Title Genre Premiere Seasons Runtime Language Status Service Format Origin
##    <chr> <chr> <chr>    <chr>   <chr>   <chr>    <fct>  <fct>   <fct>  <fct> 
##  1 Bosch Dete… Februar… 6 seas… <NA>    <NA>     Renew… Amazon… Series Origi…
##  2 Hand… Psyc… Septemb… 2 seas… <NA>    <NA>     Ended  Amazon… Series Origi…
##  3 The … Alte… Novembe… 4 seas… <NA>    <NA>     Ended  Amazon… Series Origi…
##  4 Mad … Dram… January… 1 seas… <NA>    <NA>     Ended  Amazon… Series Origi…
##  5 Goli… Lega… October… 3 seas… <NA>    <NA>     Renew… Amazon… Series Origi…
##  6 Good… Hist… October… 1 seas… <NA>    <NA>     Ended  Amazon… Series Origi…
##  7 Snea… Crim… January… 3 seas… <NA>    <NA>     Ended  Amazon… Series Origi…
##  8 Z: T… Hist… January… 1 seas… <NA>    <NA>     Ended  Amazon… Series Origi…
##  9 Patr… Crim… Februar… 2 seas… <NA>    <NA>     Ended  Amazon… Series Origi…
## 10 The … Hist… July 28… 1 seas… <NA>    <NA>     Ended  Amazon… Series Origi…
## # … with 485 more rows, and 2 more variables: Network <chr>, Region <chr>
amazon <- transmute(amazon_originals, 
                                Title = Title,
                                Genre = genres_to_uppercase_unique(str_replace_all(
                                          str_squish(
                                            str_replace_all(
                                              str_replace_all(
                                                str_replace_all(
                                                  str_replace_all(
                                                    str_replace_all(Genre, "series|procedural", ""),
                                                  "Science fiction", "Science-Fiction"),
                                                "(C|c)oming-of-age ", ""),
                                              "(C|c)omedy-(D|d)rama", "Dramedy"),
                                            "(Docu|docu).*", "Documentary")
                                          ),
                                        "/| ", " / ")),
                                Premiere = as.Date(str_replace_all(str_replace_all(Premiere, ",", ""), " ", "-"), "%B-%d-%Y"),
                                Episodes = strtoi(ifelse(!is.na(str_match(Seasons, "([0-9]+) (episode*)")[,2]),
                                                  str_match(Seasons, "([0-9]+) (episode*)")[,2],
                                                  NA)),
                                Seasons = strtoi(ifelse(!is.na(str_match(Seasons, "([0-9]) (season.*|part.*|volume.*)")[,2]),
                                                  str_match(Seasons, "([0-9]) (season.*|part.*|volume.*)")[,2],
                                                  NA)),
                                Min_Time = strtoi(ifelse(!is.na(str_match(Runtime, "([0-9]+)-([0-9]+)")[,2]),
                                                  str_match(Runtime, "([0-9]+)-([0-9]+)")[,2],
                                                  str_match(Runtime, "([0-9]+)")[,2])),
                                Max_Time = strtoi(ifelse(!is.na(str_match(Runtime, "([0-9]+)-([0-9]+)")[,3]),
                                                  str_match(Runtime, "([0-9]+)-([0-9]+)")[,3],
                                                  str_match(Runtime, "([0-9]+)")[,2])),
                                Language = Language,
                                Status = Status,
                                Service = Service,
                                Format = as.factor(Format),
                                Origin = as.factor(Origin),
                                Network = as.factor(Network),
                                Region = as.factor(Region)
                               )

amazon
## # A tibble: 495 x 14
##    Title Genre Premiere   Episodes Seasons Min_Time Max_Time Language Status
##    <chr> <chr> <date>        <int>   <int>    <int>    <int> <chr>    <fct> 
##  1 Bosch "Det… 2015-02-13       60       6       NA       NA <NA>     Renew…
##  2 Hand… "Psy… 2015-09-04       20       2       NA       NA <NA>     Ended 
##  3 The … "Alt… 2015-11-20       40       4       NA       NA <NA>     Ended 
##  4 Mad … " Dr… 2016-01-22       10       1       NA       NA <NA>     Ended 
##  5 Goli… "Leg… 2016-10-14       24       3       NA       NA <NA>     Renew…
##  6 Good… "His… 2016-10-28       10       1       NA       NA <NA>     Ended 
##  7 Snea… "Cri… 2017-01-13       30       3       NA       NA <NA>     Ended 
##  8 Z: T… "His… 2017-01-27       10       1       NA       NA <NA>     Ended 
##  9 Patr… "Cri… 2017-02-24       18       2       NA       NA <NA>     Ended 
## 10 The … "His… 2017-07-28        9       1       NA       NA <NA>     Ended 
## # … with 485 more rows, and 5 more variables: Service <fct>, Format <fct>,
## #   Origin <fct>, Network <fct>, Region <fct>
rm("amazon_originals")

Transmute Hulu

Transmutation of the given Hulu series and movie table

hulu_originals <- filter(hulu_originals,
                            !grepl("Pending|pending", Title),
                            Title != "Awaiting release",
                            !grepl("Miniseries|miniseries", Title),
                            !grepl("Renewed|renewed", Title),
                            !grepl("due to premiere", Title))

hulu_originals
## # A tibble: 224 x 12
##    Title Genre Premiere Seasons Runtime Language Status Service Format Origin
##    <chr> <chr> <chr>    <chr>   <chr>   <chr>    <fct>  <fct>   <fct>  <fct> 
##  1 The … Dram… May 2, … 10 epi… 5-7 mi… <NA>     Minis… Hulu    Series Origi…
##  2 East… Teen… June 3,… 4 seas… 22-24 … <NA>     Ended  Hulu    Series Origi…
##  3 11.2… Dram… Februar… 8 epis… 44-81 … <NA>     Minis… Hulu    Series Origi…
##  4 The … Dram… March 3… 3 seas… 45-56 … <NA>     Ended  Hulu    Series Origi…
##  5 Frea… Horr… October… 2 seas… 22-24 … <NA>     Ended  Hulu    Series Origi…
##  6 Chan… Crim… October… 2 seas… 40-42 … <NA>     Ended  Hulu    Series Origi…
##  7 Shut… Dram… Decembe… 2 seas… 40-42 … <NA>     Ended  Hulu    Series Origi…
##  8 Dime… Scie… April 4… 1 seas… 40 min. <NA>     Ended  Hulu    Series Origi…
##  9 The … Dyst… April 2… 3 seas… 44-64 … <NA>     Renew… Hulu    Series Origi…
## 10 Marv… Supe… Novembe… 3 seas… 46-53 … <NA>     Ended  Hulu    Series Origi…
## # … with 214 more rows, and 2 more variables: Network <chr>, Region <chr>
hulu <- transmute(hulu_originals, 
                                Title = Title,
                                Genre = genres_to_uppercase_unique(str_replace_all(
                                          str_squish(
                                            str_replace_all(
                                              str_replace_all(
                                                str_replace_all(
                                                  str_replace_all(
                                                    str_replace_all(Genre, "series|procedural", ""),
                                                  "Science fiction", "Science-Fiction"),
                                                "(C|c)oming-of-age ", ""),
                                              "(C|c)omedy-(D|d)rama", "Dramedy"),
                                            "(Docu|docu).*", "Documentary")
                                          ),
                                        "/| ", " / ")),
                                Premiere = as.Date(str_replace_all(str_replace_all(Premiere, ",", ""), " ", "-"), "%B-%d-%Y"),
                                Episodes = strtoi(ifelse(!is.na(str_match(Seasons, "([0-9]+) (episode*)")[,2]),
                                                  str_match(Seasons, "([0-9]+) (episode*)")[,2],
                                                  NA)),
                                Seasons = strtoi(ifelse(!is.na(str_match(Seasons, "([0-9]) (season.*|part.*|volume.*)")[,2]),
                                                  str_match(Seasons, "([0-9]) (season.*|part.*|volume.*)")[,2],
                                                  NA)),
                                Min_Time = strtoi(ifelse(!is.na(str_match(Runtime, "([0-9]+)-([0-9]+)")[,2]),
                                                  str_match(Runtime, "([0-9]+)-([0-9]+)")[,2],
                                                  str_match(Runtime, "([0-9]+)")[,2])),
                                Max_Time = strtoi(ifelse(!is.na(str_match(Runtime, "([0-9]+)-([0-9]+)")[,3]),
                                                  str_match(Runtime, "([0-9]+)-([0-9]+)")[,3],
                                                  str_match(Runtime, "([0-9]+)")[,2])),
                                Language = Language,
                                Status = Status,
                                Service = Service,
                                Format = as.factor(Format),
                                Origin = as.factor(Origin),
                                Network = as.factor(Network),
                                Region = as.factor(Region)
                               )

hulu
## # A tibble: 224 x 14
##    Title Genre Premiere   Episodes Seasons Min_Time Max_Time Language Status
##    <chr> <chr> <date>        <int>   <int>    <int>    <int> <chr>    <fct> 
##  1 The … " Dr… 2011-05-02       10      NA        5        7 <NA>     Minis…
##  2 East… "Tee… 2013-06-03       61       4       22       24 <NA>     Ended 
##  3 11.2… " Dr… 2016-02-15        8      NA       44       81 <NA>     Minis…
##  4 The … " Dr… 2016-03-30       36       3       45       56 <NA>     Ended 
##  5 Frea… "Hor… 2016-10-10       20       2       22       24 <NA>     Ended 
##  6 Chan… "Cri… 2016-10-19       20       2       40       42 <NA>     Ended 
##  7 Shut… " Dr… 2016-12-07       20       2       40       42 <NA>     Ended 
##  8 Dime… "Sci… 2017-04-04        6       1       40       40 <NA>     Ended 
##  9 The … "Dys… 2017-04-26       36       3       44       64 <NA>     Renew…
## 10 Marv… "Sup… 2017-11-21       33       3       46       53 <NA>     Ended 
## # … with 214 more rows, and 5 more variables: Service <fct>, Format <fct>,
## #   Origin <fct>, Network <fct>, Region <fct>
rm("hulu_originals")

Transmute Disney+

Transmutation of the given Disney+ series and movie table

disney_originals <- filter(disney_originals,
                            !grepl("Pending|pending", Title),
                            Title != "Awaiting release",
                            !grepl("Miniseries|miniseries", Title),
                            !grepl("Renewed|renewed", Title),
                            !grepl("due to premiere", Title))

disney_originals
## # A tibble: 193 x 12
##    Title Genre Premiere Seasons Runtime Language Status Service Format Origin
##    <chr> <chr> <chr>    <chr>   <chr>   <lgl>    <fct>  <fct>   <fct>  <fct> 
##  1 The … Spac… Novembe… 2 seas… 31–54 … NA       Renew… Disney+ Series Origi…
##  2 The … Hist… October… 1 seas… 42–52 … NA       Pendi… Disney+ Series Origi…
##  3 Wand… Supe… January… 6 epis… TBA     NA       Minis… Disney+ Series Origi…
##  4 The … Supe… March 1… 6 epis… TBA     NA       Minis… Disney+ Series Origi…
##  5 High… Musi… Novembe… 1 seas… 26–34 … NA       Renew… Disney+ Series Origi…
##  6 Diar… Come… January… 1 seas… 22–28 … NA       Renew… Disney+ Series Origi…
##  7 Fork… Anim… Novembe… 1 seas… 3–4 mi… NA       Pendi… Disney+ Series Origi…
##  8 Spar… Anim… Novembe… 2 seas… 7–12 m… NA       <NA>   Disney+ Series Origi…
##  9 Shor… Anim… January… 1 seas… 5–7 mi… NA       Pendi… Disney+ Series Origi…
## 10 Zeni… Anim… May 22,… 1 seas… 5–49 m… NA       Pendi… Disney+ Series Origi…
## # … with 183 more rows, and 2 more variables: Network <chr>, Region <chr>
disney <- transmute(disney_originals, 
                                Title = Title,
                                Genre = genres_to_uppercase_unique(str_replace_all(
                                          str_squish(
                                            str_replace_all(
                                              str_replace_all(
                                                str_replace_all(
                                                  str_replace_all(
                                                    str_replace_all(Genre, "series|procedural", ""),
                                                  "Science fiction", "Science-Fiction"),
                                                "(C|c)oming-of-age ", ""),
                                              "(C|c)omedy-(D|d)rama", "Dramedy"),
                                            "(Docu|docu).*", "Documentary")
                                          ),
                                        "/| ", " / ")),
                                Premiere = as.Date(str_replace_all(str_replace_all(Premiere, ",", ""), " ", "-"), "%B-%d-%Y"),
                                Episodes = strtoi(ifelse(!is.na(str_match(Seasons, "([0-9]+) (episode*)")[,2]),
                                                  str_match(Seasons, "([0-9]+) (episode*)")[,2],
                                                  NA)),
                                Seasons = strtoi(ifelse(!is.na(str_match(Seasons, "([0-9]) (season.*|part.*|volume.*)")[,2]),
                                                  str_match(Seasons, "([0-9]) (season.*|part.*|volume.*)")[,2],
                                                  NA)),
                                Min_Time = strtoi(ifelse(!is.na(str_match(Runtime, "([0-9]+)-([0-9]+)")[,2]),
                                                  str_match(Runtime, "([0-9]+)-([0-9]+)")[,2],
                                                  str_match(Runtime, "([0-9]+)")[,2])),
                                Max_Time = strtoi(ifelse(!is.na(str_match(Runtime, "([0-9]+)-([0-9]+)")[,3]),
                                                  str_match(Runtime, "([0-9]+)-([0-9]+)")[,3],
                                                  str_match(Runtime, "([0-9]+)")[,2])),
                                Language = Language,
                                Status = Status,
                                Service = Service,
                                Format = as.factor(Format),
                                Origin = as.factor(Origin),
                                Network = as.factor(Network),
                                Region = as.factor(Region)
                               )

disney
## # A tibble: 193 x 14
##    Title Genre Premiere   Episodes Seasons Min_Time Max_Time Language Status
##    <chr> <chr> <date>        <int>   <int>    <int>    <int> <lgl>    <fct> 
##  1 The … Spac… 2019-11-12       16       2       31       31 NA       Renew…
##  2 The … Hist… 2020-10-09        8       1       42       42 NA       Pendi…
##  3 Wand… Supe… 2021-01-15        6      NA       NA       NA NA       Minis…
##  4 The … Supe… 2021-03-19        6      NA       NA       NA NA       Minis…
##  5 High… Musi… 2019-11-12       10       1       26       26 NA       Renew…
##  6 Diar… Dram… 2020-01-17       10       1       22       22 NA       Renew…
##  7 Fork… Anim… 2019-11-12       10       1        3        3 NA       Pendi…
##  8 Spar… Anim… 2019-11-12        7       2        7        7 NA       <NA>  
##  9 Shor… Anim… 2020-01-24       14       1        5        5 NA       Pendi…
## 10 Zeni… Anim… 2020-05-22       11       1        5        5 NA       Pendi…
## # … with 183 more rows, and 5 more variables: Service <fct>, Format <fct>,
## #   Origin <fct>, Network <fct>, Region <fct>
rm("disney_originals")

Transmute Apple TV+

Transmutation of the given Apple TV+ series and movie table

apple_originals <- filter(apple_originals,
                          !grepl("Pending|pending", Title),
                          Title != "Awaiting release",
                          !grepl("Miniseries|miniseries", Title),
                          !grepl("Renewed|renewed", Title),
                          !grepl("due to premiere", Title))

apple_originals
## # A tibble: 120 x 12
##    Title Genre Premiere Seasons Runtime Language Status Service Format Origin
##    <chr> <chr> <chr>    <chr>   <chr>   <lgl>    <fct>  <fct>   <fct>  <fct> 
##  1 For … Alte… Novembe… 1 seas… 48-76 … NA       Renew… Apple … Series Origi…
##  2 The … Dram… Novembe… 1 seas… 50-69 … NA       Renew… Apple … Series Origi…
##  3 See   Scie… Novembe… 1 seas… 49-57 … NA       Renew… Apple … Series Origi…
##  4 Serv… Psyc… Novembe… 1 seas… 29-36 … NA       Renew… Apple … Series Origi…
##  5 Trut… Lega… Decembe… 1 seas… 39-50 … NA       Renew… Apple … Series Origi…
##  6 Amaz… Scie… March 6… 1 seas… 50 min. NA       Pendi… Apple … Series Origi…
##  7 Home… Myst… April 3… 1 seas… 50 min. NA       Renew… Apple … Series Origi…
##  8 Defe… Crim… April 2… 8 epis… 45-65 … NA       Minis… Apple … Series Origi…
##  9 Dick… Peri… Novembe… 1 seas… 30 min. NA       Renew… Apple … Series Origi…
## 10 Ghos… Fami… Novembe… 2 seas… 30 min. NA       Pendi… Apple … Series Origi…
## # … with 110 more rows, and 2 more variables: Network <chr>, Region <chr>
apple <- transmute(apple_originals, 
                   Title = Title,
                                Genre = genres_to_uppercase_unique(str_replace_all(
                                          str_squish(
                                            str_replace_all(
                                              str_replace_all(
                                                str_replace_all(
                                                  str_replace_all(
                                                    str_replace_all(Genre, "series|procedural", ""),
                                                  "Science fiction", "Science-Fiction"),
                                                "(C|c)oming-of-age ", ""),
                                              "(C|c)omedy-(D|d)rama", "Dramedy"),
                                            "(Docu|docu).*", "Documentary")
                                          ),
                                        "/| ", " / ")),
                                Premiere = as.Date(str_replace_all(str_replace_all(Premiere, ",", ""), " ", "-"), "%B-%d-%Y"),
                                Episodes = strtoi(ifelse(!is.na(str_match(Seasons, "([0-9]+) (episode*)")[,2]),
                                                  str_match(Seasons, "([0-9]+) (episode*)")[,2],
                                                  NA)),
                                Seasons = strtoi(ifelse(!is.na(str_match(Seasons, "([0-9]) (season.*|part.*|volume.*)")[,2]),
                                                  str_match(Seasons, "([0-9]) (season.*|part.*|volume.*)")[,2],
                                                  NA)),
                                Min_Time = strtoi(ifelse(!is.na(str_match(Runtime, "([0-9]+)-([0-9]+)")[,2]),
                                                  str_match(Runtime, "([0-9]+)-([0-9]+)")[,2],
                                                  str_match(Runtime, "([0-9]+)")[,2])),
                                Max_Time = strtoi(ifelse(!is.na(str_match(Runtime, "([0-9]+)-([0-9]+)")[,3]),
                                                  str_match(Runtime, "([0-9]+)-([0-9]+)")[,3],
                                                  str_match(Runtime, "([0-9]+)")[,2])),
                                Language = Language,
                                Status = Status,
                                Service = Service,
                                Format = as.factor(Format),
                                Origin = as.factor(Origin),
                                Network = as.factor(Network),
                                Region = as.factor(Region)
                               )

apple
## # A tibble: 120 x 14
##    Title Genre Premiere   Episodes Seasons Min_Time Max_Time Language Status
##    <chr> <chr> <date>        <int>   <int>    <int>    <int> <lgl>    <fct> 
##  1 For … "Alt… 2019-11-01       10       1       48       76 NA       Renew…
##  2 The … " Dr… 2019-11-01       10       1       50       69 NA       Renew…
##  3 See   "Sci… 2019-11-01        8       1       49       57 NA       Renew…
##  4 Serv… "Psy… 2019-11-28       10       1       29       36 NA       Renew…
##  5 Trut… "Leg… 2019-12-06        8       1       39       50 NA       Renew…
##  6 Amaz… "Sci… 2020-03-06        5       1       50       50 NA       Pendi…
##  7 Home… "Mys… 2020-04-03       10       1       50       50 NA       Renew…
##  8 Defe… "Cri… 2020-04-24        8      NA       45       65 NA       Minis…
##  9 Dick… "Per… 2019-11-01       10       1       30       30 NA       Renew…
## 10 Ghos… "Fam… 2019-11-01       20       2       30       30 NA       Pendi…
## # … with 110 more rows, and 5 more variables: Service <fct>, Format <fct>,
## #   Origin <fct>, Network <fct>, Region <fct>
rm("apple_originals")

Combine Services

The goal now is to combine the results from the transmutation into one large data frame that can then be explored to gain a general grasp of what the data looks like for originals from four of the largest streaming services.

original_content <- rbind(netflix, amazon)
original_content <- rbind(original_content, hulu)
original_content <- rbind(original_content, disney)
(original_content <- rbind(original_content, apple))
## # A tibble: 3,979 x 14
##    Title Genre Premiere   Episodes Seasons Min_Time Max_Time Language Status
##    <chr> <chr> <date>        <int>   <int>    <int>    <int> <chr>    <fct> 
##  1 Stra… Scie… 2016-07-15       25       3       42       78 <NA>     Renew…
##  2 The … Hist… 2016-11-04       40       4       47       61 <NA>     Renew…
##  3 Ozark Crim… 2017-07-21       30       3       52       80 <NA>     Renew…
##  4 Lost… Scie… 2018-04-13       20       2       39       66 <NA>     Renew…
##  5 Chil… Supe… 2018-10-26       28       3       50       64 <NA>     Await…
##  6 Narc… Crim… 2018-11-16       20       2       45       70 <NA>     Renew…
##  7 The … Supe… 2019-02-15       20       2       40       60 <NA>     Renew…
##  8 Blac… Zomb… 2019-04-11        8       1       21       45 <NA>     Renew…
##  9 Anot… Scie… 2019-07-25       10       1       37       61 <NA>     Renew…
## 10 Crim… Poli… 2019-09-20        7       2       41       47 <NA>     Pendi…
## # … with 3,969 more rows, and 5 more variables: Service <fct>, Format <fct>,
## #   Origin <fct>, Network <fct>, Region <fct>

Join with Movies

natural_joined <- full_join(movies, original_content, by = "Title")

natural_joined <- mutate(natural_joined, `Genre.x` = ifelse(is.na(`Genre.x`), `Genre.y`, `Genre.x`))
natural_joined <- mutate(natural_joined, `Language.x` = ifelse(is.na(`Language.x`), `Language.y`, `Language.x`))
natural_joined <- mutate(natural_joined, `Format.x` = ifelse(is.na(`Format.x`), `Format.y`, `Format.x`))
natural_joined <- mutate(natural_joined, `Status.x` = ifelse(is.na(`Status.x`), `Status.y`, `Status.x`))
natural_joined <- mutate(natural_joined, Country = ifelse(is.na(Country), Region, Country))
natural_joined <- mutate(natural_joined, Min_Time = as.integer(ifelse(is.na(Min_Time), Runtime, Min_Time)))
natural_joined <- mutate(natural_joined, Max_Time = as.integer(ifelse(is.na(Max_Time), Runtime, Max_Time)))

natural_joined <- natural_joined[, !(names(natural_joined) %in% c("Genre.y", "Language.y", "Region", "Format.y", "Status.y", "Runtime"))]

(natural_joined <- rename(natural_joined, Genre = `Genre.x`, Language = `Language.x`, Format = `Format.x`, Status = `Status.x`))
## # A tibble: 19,885 x 24
##    Title Genre  Year Age    IMDb `Rotten Tomatoe… Netflix Hulu  `Prime Video`
##    <chr> <chr> <dbl> <chr> <dbl> <chr>            <lgl>   <lgl> <lgl>        
##  1 Ince… Acti…  2010 13+     8.8 87%              TRUE    FALSE FALSE        
##  2 The … Acti…  1999 18+     8.7 87%              TRUE    FALSE FALSE        
##  3 Aven… Acti…  2018 13+     8.5 84%              TRUE    FALSE FALSE        
##  4 Back… Adve…  1985 7+      8.5 96%              TRUE    FALSE FALSE        
##  5 The … West…  1966 18+     8.8 97%              TRUE    FALSE TRUE         
##  6 Spid… Anim…  2018 7+      8.4 97%              TRUE    FALSE FALSE        
##  7 The … Biog…  2002 18+     8.5 95%              TRUE    FALSE TRUE         
##  8 Djan… Dram…  2012 18+     8.4 87%              TRUE    FALSE FALSE        
##  9 Raid… Acti…  1981 7+      8.4 95%              TRUE    FALSE FALSE        
## 10 Ingl… Adve…  2009 18+     8.3 89%              TRUE    FALSE FALSE        
## # … with 19,875 more rows, and 15 more variables: `Disney+` <lgl>, `Apple
## #   TV+` <lgl>, Director <chr>, Country <chr>, Language <chr>, Format <int>,
## #   Status <int>, Premiere <date>, Episodes <int>, Seasons <int>,
## #   Min_Time <int>, Max_Time <int>, Service <fct>, Origin <fct>, Network <fct>

Join with Shows

content <- full_join(shows, natural_joined, by = "Title")

content <- mutate(content, `Year.x` = as.integer(ifelse(is.na(`Year.x`), `Year.y`, `Year.x`)))
content <- mutate(content, `Age.x` = ifelse(is.na(`Age.x`), `Age.y`, `Age.x`))
content <- mutate(content, `IMDb.x` = ifelse(is.na(`IMDb.x`), `IMDb.y`, `IMDb.x`))
content <- mutate(content, `Rotten Tomatoes.x` = ifelse(is.na(`Rotten Tomatoes.x`), `Rotten Tomatoes.y`, `Rotten Tomatoes.x`))
content <- mutate(content, `Netflix.x` = ifelse(is.na(`Netflix.x`), `Netflix.y`, `Netflix.x`))
content <- mutate(content, `Hulu.x` = ifelse(is.na(`Hulu.x`), `Hulu.y`, `Hulu.x`))
content <- mutate(content, `Prime Video.x` = ifelse(is.na(`Prime Video.x`), `Prime Video.y`, `Prime Video.x`))
content <- mutate(content, `Disney+.x` = ifelse(is.na(`Disney+.x`), `Disney+.y`, `Disney+.x`))
content <- mutate(content, `Apple TV+.x` = ifelse(is.na(`Apple TV+.x`), `Apple TV+.y`, `Apple TV+.x`))
content <- mutate(content, `Format.x` = as.integer(ifelse(is.na(`Format.x`), `Format.y`, `Format.x`)))
content <- mutate(content, `Status.x` = as.integer(ifelse(is.na(`Status.x`), `Status.y`, `Status.x`)))

content <- content[, !(names(content) %in% c("Year.y", "Age.y", "IMDb.y", "Rotten Tomatoes.y", "Netflix.y", "Hulu.y", "Prime Video.y", "Disney+.y", "Apple TV+.y", "Format.y", "Status.y"))]

content <- rename(content, Year = `Year.x`, Age = `Age.x`, IMDb = `IMDb.x`, `Rotten Tomatoes` = `Rotten Tomatoes.x`, Netflix = `Netflix.x`, Hulu = `Hulu.x`, `Prime Video` = `Prime Video.x`, `Disney+` = `Disney+.x`, `Apple TV+` = `Apple TV+.x`, Format = `Format.x`, Status = `Status.x`)

content <- mutate(content, Month = as.integer(ifelse(is.na(Premiere), NA, as.numeric(format(Premiere, "%m")))))
content <- mutate(content, Day = as.integer(ifelse(is.na(Premiere), NA, as.numeric(format(Premiere, "%d")))))

content <- mutate(content, Age = parse_factor(ifelse(is.na(Age), "unknown", Age), levels = ages))
content <- mutate(content, Origin = parse_factor(origins[ifelse(is.na(Origin), parse_factor("Distribution", levels = origins), Origin)], levels = origins))

content <- mutate(content, Format = parse_factor(formats[Format], levels = formats))
content <- mutate(content, Status = parse_factor(statuses[Status], levels = statuses))

content <- mutate(content, `Rotten Tomatoes` = as.integer(str_replace(`Rotten Tomatoes`, "%", "")))

content <- mutate(content, Year = ifelse(is.na(Year), ifelse(is.na(Premiere), NA, lubridate::year(Premiere)), Year))

(content <- content[, !(names(content) %in% c("Premiere"))])
## # A tibble: 24,532 x 25
##    Title  Year Age    IMDb `Rotten Tomatoe… Netflix Hulu  `Prime Video`
##    <chr> <dbl> <fct> <dbl>            <int> <lgl>   <lgl> <lgl>        
##  1 Brea…  2008 18+     9.5               96 TRUE    FALSE FALSE        
##  2 Stra…  2016 16+     8.8               93 TRUE    FALSE FALSE        
##  3 Mone…  2017 18+     8.4               91 TRUE    FALSE FALSE        
##  4 Sher…  2010 16+     9.1               78 TRUE    FALSE FALSE        
##  5 Bett…  2015 18+     8.7               97 TRUE    FALSE FALSE        
##  6 The …  2005 16+     8.9               81 TRUE    FALSE FALSE        
##  7 Blac…  2011 18+     8.8               83 TRUE    FALSE FALSE        
##  8 Supe…  2005 16+     8.4               93 TRUE    FALSE FALSE        
##  9 Peak…  2013 18+     8.8               92 TRUE    FALSE FALSE        
## 10 Avat…  2005 7+      9.2              100 TRUE    FALSE FALSE        
## # … with 24,522 more rows, and 17 more variables: `Disney+` <lgl>, `Apple
## #   TV+` <lgl>, Format <fct>, Status <fct>, Genre <chr>, Director <chr>,
## #   Country <chr>, Language <chr>, Episodes <int>, Seasons <int>,
## #   Min_Time <int>, Max_Time <int>, Service <fct>, Origin <fct>, Network <fct>,
## #   Month <int>, Day <int>

Add missing Apple TV+ original age ratings

apple_ages <- read_csv("../datasets/apple_ages.csv")
## Parsed with column specification:
## cols(
##   Title = col_character(),
##   Age = col_character()
## )
apple_ages <- mutate(apple_ages, Age = parse_factor(Age, levels = ages))

content <- full_join(content, apple_ages, by = "Title")
content <- mutate(content, `Age.x` = parse_factor(ages[ifelse(is.na(`Age.x`) | `Age.x` == "unknown", parse_factor(ifelse(is.na(`Age.y`), "unknown", `Age.y`)), `Age.x`)], levels = ages))
content <- content[, !(names(content) %in% c("Age.y"))]
content <- rename(content, Age = `Age.x`)

rm("apple_ages")

content <- mutate(content, Age = parse_factor(ages[ifelse(Age == "7+", parse_factor("6+", levels = ages), Age)], levels = ages))
content <- mutate(content, Age = parse_factor(ages[ifelse(Age == "13+", parse_factor("12+", levels = ages), Age)], levels = ages))
(content <- mutate(content, Age = parse_factor(ages[ifelse(Age == "all", parse_factor("0+", levels = ages), Age)], levels = ages)))
## # A tibble: 24,536 x 25
##    Title  Year Age    IMDb `Rotten Tomatoe… Netflix Hulu  `Prime Video`
##    <chr> <dbl> <fct> <dbl>            <int> <lgl>   <lgl> <lgl>        
##  1 Brea…  2008 18+     9.5               96 TRUE    FALSE FALSE        
##  2 Stra…  2016 16+     8.8               93 TRUE    FALSE FALSE        
##  3 Mone…  2017 18+     8.4               91 TRUE    FALSE FALSE        
##  4 Sher…  2010 16+     9.1               78 TRUE    FALSE FALSE        
##  5 Bett…  2015 18+     8.7               97 TRUE    FALSE FALSE        
##  6 The …  2005 16+     8.9               81 TRUE    FALSE FALSE        
##  7 Blac…  2011 18+     8.8               83 TRUE    FALSE FALSE        
##  8 Supe…  2005 16+     8.4               93 TRUE    FALSE FALSE        
##  9 Peak…  2013 18+     8.8               92 TRUE    FALSE FALSE        
## 10 Avat…  2005 6+      9.2              100 TRUE    FALSE FALSE        
## # … with 24,526 more rows, and 17 more variables: `Disney+` <lgl>, `Apple
## #   TV+` <lgl>, Format <fct>, Status <fct>, Genre <chr>, Director <chr>,
## #   Country <chr>, Language <chr>, Episodes <int>, Seasons <int>,
## #   Min_Time <int>, Max_Time <int>, Service <fct>, Origin <fct>, Network <fct>,
## #   Month <int>, Day <int>

Dataset Documentation

unique(select(content, Origin))
## # A tibble: 4 x 1
##   Origin      
##   <fct>       
## 1 Distribution
## 2 Original    
## 3 Continuation
## 4 <NA>
filter(content, !is.na(Service))
## # A tibble: 3,999 x 25
##    Title  Year Age    IMDb `Rotten Tomatoe… Netflix Hulu  `Prime Video`
##    <chr> <dbl> <fct> <dbl>            <int> <lgl>   <lgl> <lgl>        
##  1 Stra…  2016 16+     8.8               93 TRUE    FALSE FALSE        
##  2 Bett…  2015 18+     8.7               97 TRUE    FALSE FALSE        
##  3 Peak…  2013 18+     8.8               92 TRUE    FALSE FALSE        
##  4 Dark   2017 16+     8.7               94 TRUE    FALSE FALSE        
##  5 Ozark  2017 18+     8.4               81 TRUE    FALSE FALSE        
##  6 Narc…  2015 18+     8.8               89 TRUE    FALSE FALSE        
##  7 Mind…  2017 18+     8.6               96 TRUE    FALSE FALSE        
##  8 The …  2019 18+     8.3               67 TRUE    FALSE FALSE        
##  9 Outl…  2014 18+     8.4               91 TRUE    FALSE FALSE        
## 10 Outl…  2014 18+     8.4               91 TRUE    FALSE FALSE        
## # … with 3,989 more rows, and 17 more variables: `Disney+` <lgl>, `Apple
## #   TV+` <lgl>, Format <fct>, Status <fct>, Genre <chr>, Director <chr>,
## #   Country <chr>, Language <chr>, Episodes <int>, Seasons <int>,
## #   Min_Time <int>, Max_Time <int>, Service <fct>, Origin <fct>, Network <fct>,
## #   Month <int>, Day <int>

Now that the data set is fully combined, we should add documentation to the columns, to describe what they are representing.

  • Title
    • Data type: chr
    • Description: The title of the series or movie (potentially followed by listing only specific seasons, because the streaming service only acquired rights for these specific seasons)
  • Year
    • Data type: int
    • Description: The year in which the series or movie originally premiered
  • Age
    • Data type: fctr
    • Description: The age restriction of the series or movie (for unification purposes 7+ & 13+ have been converted to 6+ and 12+)
  • IMDb
    • Data type: dbl
    • Description: The IMDb (Internet Movie Database) rating, values from 1..10
  • Rotten Tomatoes
    • Data type: int
    • Description: The Rotten Tomatoes score of the series or movie
  • Netflix
    • Data type: lgl
    • Description: Identifier for whether the series or movie is available for streaming on Netflix
  • Hulu
    • Data type: lgl
    • Description: Identifier for whether the series or movie is available for streaming on Hulu
  • Prime Video
    • Data type: lgl
    • Description: Identifier for whether the series or movie is available for streaming on Amazon Prime Video
  • Disney+
    • Data type: lgl
    • Description: Identifier for whether the series or movie is available for streaming on Disney+
  • Apple TV+
    • Data type: lgl
    • Description: Identifier for whether the series or movie is available for streaming on Apple TV+
  • Format
    • Data type: fctr
    • Description: Format of the production, can be of following types: Series, Film, Miniseries, Special
  • Status
    • Data type: fctr
    • Description: Status of the production, can be of following types: Ended, Renewed, Pending, Miniseries, Event, Development, Distribution, Special, NA
  • Genre
    • Data type: chr
    • Description: List of genres of the series or movie (separated by " / ")
  • Director
    • Data type: chr
    • Description: List of the names of directors of the series or movie (separated by " / ")
  • Country
    • Data type: chr
    • Description: List of the countries of the series or movie (separated by " / ")
  • Language
    • Data type: chr
    • Description: List of the languages the series or movie was originally released in (separated by " / ")
  • Episodes
    • Data type: int
    • Description: Number of episodes
  • Seasons
    • Data type: int
    • Description: Number of seasons
  • Min_Time
    • Data type: int
    • Description: The smallest runtime that can be found (min and max runtime are equal for movies or productions with only one episode)
  • Max_Time
    • Data type: int
    • Description: The largest runtime that can be found (min and max runtime are equal for movies or productions with only one episode)
  • Service
    • Data type: fctr
    • Description: The streaming service that originally produced the content (value is NA if the production is no original content), can be of following types: Netflix, Amazon Prime Video, Hulu, Disney+, Apple TV+
  • Origin
    • Data type: fctr
    • Description: The origin of the production, can be of following types: Original, Distribution, Continuation, NA (no valid info found)
  • Network
    • Data type: fctr
    • Description: The network that (helped) produce the series or movie, usually mentioned if it is a co-production or distribution
  • Month
    • Data type: int
    • Description: The month in which the series or movie originally premiered
  • Day
    • Data type: int
    • Description: The day on which the series or movie originally premiered

Add other missing Apple TV+ values

content <- mutate(content, Netflix = ifelse(is.na(Netflix), FALSE, Netflix))
content <- mutate(content, Hulu = ifelse(is.na(Hulu), FALSE, Hulu))
content <- mutate(content, `Prime Video` = ifelse(is.na(`Prime Video`), FALSE, `Prime Video`))
content <- mutate(content, `Disney+` = ifelse(is.na(`Disney+`), FALSE, `Disney+`))
(content <- mutate(content, `Apple TV+` = ifelse(Service == "Apple TV+", TRUE, `Apple TV+`)))
## # A tibble: 24,536 x 25
##    Title  Year Age    IMDb `Rotten Tomatoe… Netflix Hulu  `Prime Video`
##    <chr> <dbl> <fct> <dbl>            <int> <lgl>   <lgl> <lgl>        
##  1 Brea…  2008 18+     9.5               96 TRUE    FALSE FALSE        
##  2 Stra…  2016 16+     8.8               93 TRUE    FALSE FALSE        
##  3 Mone…  2017 18+     8.4               91 TRUE    FALSE FALSE        
##  4 Sher…  2010 16+     9.1               78 TRUE    FALSE FALSE        
##  5 Bett…  2015 18+     8.7               97 TRUE    FALSE FALSE        
##  6 The …  2005 16+     8.9               81 TRUE    FALSE FALSE        
##  7 Blac…  2011 18+     8.8               83 TRUE    FALSE FALSE        
##  8 Supe…  2005 16+     8.4               93 TRUE    FALSE FALSE        
##  9 Peak…  2013 18+     8.8               92 TRUE    FALSE FALSE        
## 10 Avat…  2005 6+      9.2              100 TRUE    FALSE FALSE        
## # … with 24,526 more rows, and 17 more variables: `Disney+` <lgl>, `Apple
## #   TV+` <lgl>, Format <fct>, Status <fct>, Genre <chr>, Director <chr>,
## #   Country <chr>, Language <chr>, Episodes <int>, Seasons <int>,
## #   Min_Time <int>, Max_Time <int>, Service <fct>, Origin <fct>, Network <fct>,
## #   Month <int>, Day <int>

Data Exploration

Production of original content throughout the years

To see how the production of original content has evolved over the years, we wanted to take a look at how many productions have been produced in each year. The assumption we want to check against is that there is more original content produced by the streaming services over time, since they want to avoid paying fees for the licenses to provide all the movies and series.

To make the following plots easier to read, we picked colors from the streaming services’ logos, so that the readers can identify the streaming services more naturally.

# Adding a column to sort by, will be used as a categorical variable later on
prem_year <- filter(mutate(original_content, Premiere_Year = lubridate::year(Premiere)), Premiere_Year != 2021, !is.na(Premiere_Year))

original_colors <- c(rgb(229/255, 9/255, 20/255, 1),
                     rgb(0/255, 163/255, 218/255, 1),
                     rgb(28/255, 231/255, 131/255, 1),
                     rgb(148 / 255, 31 / 255, 138 / 255, 1),
                     rgb(0, 0, 0, 1)
                     )

ggplot(data = prem_year, mapping = aes(x = Premiere_Year, fill = as.factor(Service))) +
  geom_bar() +
  theme(title = element_text(size = 14), axis.title = element_text(size = 12), legend.title = element_text(size = 12), legend.text = element_text(size = 10)) +
  labs(fill = "Premiere Year", title = "Release Years", x = "Release Year", y = "Number of movies and series") +
  scale_fill_manual(values = original_colors)

Linear regression model for the amount of original content until 2050

years <- c(2005:2020)

productions <- function(years, dataframe) {
  productions <- c()
  for (year in years) {
    productions <- c(productions, nrow(filter(select(dataframe, Premiere_Year), Premiere_Year == year)))
  }
  return(productions)
}

linear_regression <- lm(years ~ productions(years, prem_year), data = filter(prem_year, is.na(match(Premiere_Year, years))))
summary(linear_regression)
## 
## Call:
## lm(formula = years ~ productions(years, prem_year), data = filter(prem_year, 
##     is.na(match(Premiere_Year, years))))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.1336 -2.1853  0.2464  2.2166  3.8083 
## 
## Coefficients:
##                                Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)                   2.010e+03  8.509e-01 2362.446  < 2e-16 ***
## productions(years, prem_year) 1.533e-02  2.939e-03    5.217  0.00013 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.872 on 14 degrees of freedom
## Multiple R-squared:  0.6603, Adjusted R-squared:  0.6361 
## F-statistic: 27.22 on 1 and 14 DF,  p-value: 0.0001304
plot(linear_regression)

Change in Netflix content origin over the years

netflix_origin_change <- content %>% 
  filter(Year >= 2007, Netflix == TRUE) %>% 
  group_by(Year, Origin) %>% 
  summarise(n = n()) %>% 
  mutate(percentage = n / sum(n))
## `summarise()` regrouping output by 'Year' (override with `.groups` argument)
ggplot(netflix_origin_change, mapping = aes(x = Year, y = percentage, fill = factor(Origin))) +
  geom_area()

Change in Amazon Prime Video content origin over the years

amazon_origin_change <- content %>% 
  filter(Year >= 2006, `Prime Video` == TRUE) %>% 
  group_by(Year, Origin) %>% 
  summarise(n = n()) %>% 
  mutate(Percentage = n / sum(n))
## `summarise()` regrouping output by 'Year' (override with `.groups` argument)
ggplot(amazon_origin_change, mapping = aes(x = Year, y = Percentage, fill = factor(Origin))) +
  geom_area()

Change in Hulu content origin over the years

hulu_origin_change <- content %>% 
  filter(Year >= 2007, `Hulu` == TRUE) %>% 
  group_by(Year, Origin) %>% 
  summarise(n = n()) %>% 
  mutate(Percentage = n / sum(n))
## `summarise()` regrouping output by 'Year' (override with `.groups` argument)
ggplot(hulu_origin_change, mapping = aes(x = Year, y = Percentage, fill = factor(Origin))) +
  geom_area()

Change in Disney+ content origin over the years

disney_origin_change <- content %>% 
  filter(Year >= 2019, `Disney+` == TRUE) %>% 
  group_by(Year, Origin) %>% 
  summarise(n = n()) %>% 
  mutate(Percentage = n / sum(n))
## `summarise()` regrouping output by 'Year' (override with `.groups` argument)
ggplot(disney_origin_change, mapping = aes(x = Year, y = Percentage, fill = factor(Origin))) +
  geom_area()

  #scale_x_discrete("Year", limits = c(2019.00, 2019.25, 2019.50, 2019.75, 2020.00), labels = c("2019 Q1", "2019 Q2", "2019 Q3", "2019 Q4", "2020 Q1"))

Change in Apple TV+ content origin over the years

apple_origin_change <- content %>% 
  filter(Year >= 2019, `Apple TV+` == TRUE) %>% 
  group_by(Year, Origin) %>% 
  summarise(n = n()) %>% 
  mutate(Percentage = n / sum(n))
## `summarise()` regrouping output by 'Year' (override with `.groups` argument)
ggplot(apple_origin_change, mapping = aes(x = Year, y = Percentage, fill = factor(Origin))) +
  geom_area()

Production of original content grouped by months

In society there is a common agreement, that most of the interesting movies and series get released during the autumn and winter months, so we made sure to check whether this claim holds up to be true for original content as well.

prem_month <- filter(mutate(original_content,
                              Premiere_Month = as.character(lubridate::month(lubridate::ymd(010101) +
                                                                               months(lubridate::month(Premiere) - 1), label = TRUE, abbr = TRUE))),
                              lubridate::year(Premiere) != 2021, !is.na(lubridate::year(Premiere)))

ggplot(data = prem_month, mapping = aes(x = reorder(Premiere_Month, lubridate::month(Premiere)),fill = as.factor(Service))) +
  geom_bar(width = 0.85) +
  theme(title = element_text(size = 14), axis.title = element_text(size = 12), legend.title = element_text(size = 12), legend.text = element_text(size = 10)) +
  labs(fill = "Premiere Month", title = "Release Months", x = "Release Month", y = "Number of movies and series") +
  scale_fill_manual(values = original_colors)

From our feeling the amount of series compared to movies has also increased steadily, so we figured out that it is interesting to take a look at how the ratio evolved throughout the years for the various streaming services. We first started off by creating two separate graphs that display us the amount of movies or series grouped by the streaming services, but as you can see, it does not really help the situation since there is a general tendency that the streaming services produce more original content either way. This has lead to the approach that is displayed in the last of the three graphs. We set out that 50% is the baseline for the case that there are no movies and series produced. From there on we can determine what the ratio of series produced to the number of produced movies is. For Netflix for example, it is realistic to say that there is trend that can be recognized from the data that has been gathered so far.

series_and_movies <- as_tibble(filter(prem_year %>%
  group_by(Service, Premiere_Year = strtoi(Premiere_Year), Format) %>%
  summarise(Entries = length(Service)), Format == "Film" | Format == "Series"))
## `summarise()` regrouping output by 'Service', 'Premiere_Year' (override with `.groups` argument)
series_and_movies
## # A tibble: 54 x 4
##    Service Premiere_Year Format Entries
##    <fct>           <int> <fct>    <int>
##  1 Netflix          2005 Film         1
##  2 Netflix          2012 Series       1
##  3 Netflix          2012 Film         2
##  4 Netflix          2013 Series       6
##  5 Netflix          2013 Film         8
##  6 Netflix          2014 Series       7
##  7 Netflix          2014 Film        12
##  8 Netflix          2015 Series      28
##  9 Netflix          2015 Film        24
## 10 Netflix          2016 Series      57
## # … with 44 more rows
percentage_of_series <- function(dataframe) {
  series_percentages <- data.frame(Service = c(), Premiere_Year = c(), Series_Percentage = c())
  years <- c(2000:2020)
  for (service in services) {
    service_rows <- filter(dataframe, Service == service)
    for (year in years) {
      nr_series <- sum(pull(filter(service_rows, Format == "Series", Premiere_Year == year), Entries))
      nr_films <- sum(pull(filter(service_rows, Format == "Film", Premiere_Year == year), Entries))
      series_percentages <- rbind(series_percentages,
                                  data.frame(Service = as.factor(c(service)), Premiere_Year = c(year), Percentage = c(ifelse(nr_series == 0,
                                    ifelse(nr_films == 0, 50, 0), ifelse(nr_films == 0, 100, nr_series / (nr_series + nr_films) * 100)))))
    }
  }
  return(series_percentages)
}
percentage_of_series(series_and_movies)
##                Service Premiere_Year Percentage
## 1              Netflix          2000   50.00000
## 2              Netflix          2001   50.00000
## 3              Netflix          2002   50.00000
## 4              Netflix          2003   50.00000
## 5              Netflix          2004   50.00000
## 6              Netflix          2005    0.00000
## 7              Netflix          2006   50.00000
## 8              Netflix          2007   50.00000
## 9              Netflix          2008   50.00000
## 10             Netflix          2009   50.00000
## 11             Netflix          2010   50.00000
## 12             Netflix          2011   50.00000
## 13             Netflix          2012   33.33333
## 14             Netflix          2013   42.85714
## 15             Netflix          2014   36.84211
## 16             Netflix          2015   53.84615
## 17             Netflix          2016   41.00719
## 18             Netflix          2017   31.96721
## 19             Netflix          2018   36.56174
## 20             Netflix          2019   43.47826
## 21             Netflix          2020   47.28370
## 22  Amazon Prime Video          2000   50.00000
## 23  Amazon Prime Video          2001   50.00000
## 24  Amazon Prime Video          2002   50.00000
## 25  Amazon Prime Video          2003   50.00000
## 26  Amazon Prime Video          2004   50.00000
## 27  Amazon Prime Video          2005   50.00000
## 28  Amazon Prime Video          2006   50.00000
## 29  Amazon Prime Video          2007   50.00000
## 30  Amazon Prime Video          2008   50.00000
## 31  Amazon Prime Video          2009   50.00000
## 32  Amazon Prime Video          2010   50.00000
## 33  Amazon Prime Video          2011   50.00000
## 34  Amazon Prime Video          2012   50.00000
## 35  Amazon Prime Video          2013  100.00000
## 36  Amazon Prime Video          2014  100.00000
## 37  Amazon Prime Video          2015  100.00000
## 38  Amazon Prime Video          2016  100.00000
## 39  Amazon Prime Video          2017   97.43590
## 40  Amazon Prime Video          2018   93.47826
## 41  Amazon Prime Video          2019   85.36585
## 42  Amazon Prime Video          2020   58.09524
## 43                Hulu          2000   50.00000
## 44                Hulu          2001   50.00000
## 45                Hulu          2002   50.00000
## 46                Hulu          2003   50.00000
## 47                Hulu          2004   50.00000
## 48                Hulu          2005   50.00000
## 49                Hulu          2006   50.00000
## 50                Hulu          2007   50.00000
## 51                Hulu          2008   50.00000
## 52                Hulu          2009   50.00000
## 53                Hulu          2010  100.00000
## 54                Hulu          2011  100.00000
## 55                Hulu          2012  100.00000
## 56                Hulu          2013  100.00000
## 57                Hulu          2014  100.00000
## 58                Hulu          2015  100.00000
## 59                Hulu          2016  100.00000
## 60                Hulu          2017   68.75000
## 61                Hulu          2018   70.58824
## 62                Hulu          2019   72.00000
## 63                Hulu          2020   71.79487
## 64             Disney+          2000   50.00000
## 65             Disney+          2001   50.00000
## 66             Disney+          2002   50.00000
## 67             Disney+          2003   50.00000
## 68             Disney+          2004   50.00000
## 69             Disney+          2005   50.00000
## 70             Disney+          2006   50.00000
## 71             Disney+          2007   50.00000
## 72             Disney+          2008   50.00000
## 73             Disney+          2009   50.00000
## 74             Disney+          2010   50.00000
## 75             Disney+          2011   50.00000
## 76             Disney+          2012   50.00000
## 77             Disney+          2013   50.00000
## 78             Disney+          2014   50.00000
## 79             Disney+          2015   50.00000
## 80             Disney+          2016   50.00000
## 81             Disney+          2017   50.00000
## 82             Disney+          2018   50.00000
## 83             Disney+          2019   68.42105
## 84             Disney+          2020   49.18033
## 85           Apple TV+          2000   50.00000
## 86           Apple TV+          2001   50.00000
## 87           Apple TV+          2002   50.00000
## 88           Apple TV+          2003   50.00000
## 89           Apple TV+          2004   50.00000
## 90           Apple TV+          2005   50.00000
## 91           Apple TV+          2006   50.00000
## 92           Apple TV+          2007   50.00000
## 93           Apple TV+          2008   50.00000
## 94           Apple TV+          2009   50.00000
## 95           Apple TV+          2010   50.00000
## 96           Apple TV+          2011   50.00000
## 97           Apple TV+          2012   50.00000
## 98           Apple TV+          2013   50.00000
## 99           Apple TV+          2014   50.00000
## 100          Apple TV+          2015   50.00000
## 101          Apple TV+          2016   50.00000
## 102          Apple TV+          2017   50.00000
## 103          Apple TV+          2018   50.00000
## 104          Apple TV+          2019   83.33333
## 105          Apple TV+          2020   72.72727
ggplot() +
  geom_line(data = filter(series_and_movies, Format == "Film"), mapping = aes(x = Premiere_Year, y = Entries, colour = as.factor(Service)), label = "Movie") +
  labs(colour = "Streaming Services", title = "Amount of released Movies by streaming services", x = "Year", y = "Number of movies") +
  scale_colour_manual(values = original_colors)
## Warning: Ignoring unknown parameters: label

ggplot() +
  geom_line(data = filter(series_and_movies, Format == "Series"), mapping = aes(x = Premiere_Year, y = Entries, colour = as.factor(Service)), label = "Movie") +
  labs(colour = "Streaming Services", title = "Amount of released Series by streaming services", x = "Year", y = "Number of series") +
  scale_colour_manual(values = original_colors)
## Warning: Ignoring unknown parameters: label

ggplot() +
  geom_line(data = percentage_of_series(series_and_movies), mapping = aes(x = Premiere_Year, y = Percentage, colour = as.factor(Service))) +
  theme(title = element_text(size = 14), axis.title = element_text(size = 12), legend.title = element_text(size = 12), legend.text = element_text(size = 10)) +
  labs(colour = "Streaming Services", title = "% of series compared to movies", x = "Release Year", y = "% of series") +
  scale_color_manual(values = original_colors)

Genre counter

genre_string <- paste(original_content$Genre, collapse = "/")
genre_vector <- strsplit(genre_string, "/")[[1]]
genre_vector_clean <- gsub(" ", "", genre_vector)

genre_frame = transmute(data.frame(table(genre_vector_clean)),
          Genre = genre_vector_clean,
          Count = Freq)

top_x_genres <- function(dataframe, x) {
  arranged_df <- data.frame(Genre = arrange(dataframe, desc(Count))$Genre, Entries = arrange(dataframe, desc(Count))$Count)
  top_x <- data.frame(Genre = c(), Entries = c())
  for (i in 1:x) {
    top_x <- rbind(top_x, arranged_df[i, ])
  }
  return(top_x)
}

genre_count <- as.tibble(top_x_genres(genre_frame, 20) %>% summarise(Genre, Entries = strtoi(Entries)))
## Warning: `as.tibble()` is deprecated as of tibble 2.0.0.
## Please use `as_tibble()` instead.
## The signature and semantics have changed, see `?as_tibble`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
ggplot(data = genre_count) +
  geom_bar(mapping = aes(x = reorder(Genre, desc(Entries)), y = Entries, fill = Entries), stat = "identity", show.legend = FALSE) +
  theme(title = element_text(size = 14), axis.title.x = element_text(size = 12), axis.text.y = element_text(size = 11)) +
  labs(title = "Top 20 Genres - Original Content", x = "", y = "No. of original content productions") +
  coord_flip()

Age rating for all content

ggplot(data = content) +
  geom_histogram(mapping = aes(x = Year, fill = factor(Age)), binwidth = 1) +
  labs(fill = "Age")
## Warning: Removed 1121 rows containing non-finite values (stat_bin).

ggplot(data = filter(content, Format == "Series")) +
  geom_histogram(mapping = aes(x = Year, fill = factor(Age)), binwidth = 1) +
  labs(fill = "Age")
## Warning: Removed 603 rows containing non-finite values (stat_bin).

ggplot(data = filter(content, Format == "Film")) +
  geom_histogram(mapping = aes(x = Year, fill = factor(Age)), binwidth = 1) +
  labs(fill = "Age")
## Warning: Removed 498 rows containing non-finite values (stat_bin).

### Age rating for all Netflix content

ggplot(data = filter(content, Netflix == TRUE)) +
  geom_histogram(mapping = aes(x = Year, fill = factor(Age)), binwidth = 1) +
  labs(fill = "Age")

ggplot(data = filter(content, Netflix == TRUE, Format == "Series")) +
  geom_histogram(mapping = aes(x = Year, fill = factor(Age)), binwidth = 1) +
  labs(fill = "Age")

ggplot(data = filter(content, Netflix == TRUE, Format == "Film")) +
  geom_histogram(mapping = aes(x = Year, fill = factor(Age)), binwidth = 1) +
  labs(fill = "Age")

Age rating for all original Netflix content

ggplot(data = filter(content, Service == "Netflix", Origin == "Original")) +
  geom_histogram(mapping = aes(x = Year, fill = factor(Age)), binwidth = 1) +
  labs(fill = "Age")
## Warning: Removed 725 rows containing non-finite values (stat_bin).

ggplot(data = filter(content, Service == "Netflix", Format == "Series", Origin == "Original")) +
  geom_histogram(mapping = aes(x = Year, fill = factor(Age)), binwidth = 1) +
  labs(fill = "Age")
## Warning: Removed 301 rows containing non-finite values (stat_bin).

ggplot(data = filter(content, Service == "Netflix", Format == "Film", Origin == "Original")) +
  geom_histogram(mapping = aes(x = Year, fill = factor(Age)), binwidth = 1) +
  labs(fill = "Age")
## Warning: Removed 416 rows containing non-finite values (stat_bin).

Age rating for all Amazon Prime Video content

ggplot(data = filter(content, `Prime Video` == TRUE)) +
  geom_histogram(mapping = aes(x = Year, fill = factor(Age)), binwidth = 1) +
  labs(fill = "Age")

ggplot(data = filter(content, `Prime Video` == TRUE, Format == "Series")) +
  geom_histogram(mapping = aes(x = Year, fill = factor(Age)), binwidth = 1) +
  labs(fill = "Age")

ggplot(data = filter(content, `Prime Video` == TRUE, Format == "Film")) +
  geom_histogram(mapping = aes(x = Year, fill = factor(Age)), binwidth = 1) +
  labs(fill = "Age")

Age rating for all original Amazon Prime Video content

ggplot(data = filter(content, Service == "Amazon Prime Video", Origin == "Original")) +
  geom_histogram(mapping = aes(x = Year, fill = factor(Age)), binwidth = 1) +
  labs(fill = "Age")
## Warning: Removed 120 rows containing non-finite values (stat_bin).

ggplot(data = filter(content, Service == "Amazon Prime Video", Format == "Series", Origin == "Original")) +
  geom_histogram(mapping = aes(x = Year, fill = factor(Age)), binwidth = 1) +
  labs(fill = "Age")
## Warning: Removed 110 rows containing non-finite values (stat_bin).

ggplot(data = filter(content, Service == "Amazon Prime Video", Format == "Film", Origin == "Original")) +
  geom_histogram(mapping = aes(x = Year, fill = factor(Age)), binwidth = 1) +
  labs(fill = "Age")
## Warning: Removed 10 rows containing non-finite values (stat_bin).

Age rating for all Hulu content

ggplot(data = filter(content, Hulu == TRUE)) +
  geom_histogram(mapping = aes(x = Year, fill = factor(Age)), binwidth = 1) +
  labs(fill = "Age")

ggplot(data = filter(content, Hulu == TRUE, Format == "Series")) +
  geom_histogram(mapping = aes(x = Year, fill = factor(Age)), binwidth = 1) +
  labs(fill = "Age")

ggplot(data = filter(content, Hulu == TRUE, Format == "Film")) +
  geom_histogram(mapping = aes(x = Year, fill = factor(Age)), binwidth = 1) +
  labs(fill = "Age")

Age rating for all original Hulu content

ggplot(data = filter(content, Service == "Hulu", Origin == "Original")) +
  geom_histogram(mapping = aes(x = Year, fill = factor(Age)), binwidth = 1) +
  labs(fill = "Age")
## Warning: Removed 60 rows containing non-finite values (stat_bin).

ggplot(data = filter(content, Service == "Hulu", Format == "Series", Origin == "Original")) +
  geom_histogram(mapping = aes(x = Year, fill = factor(Age)), binwidth = 1) +
  labs(fill = "Age")
## Warning: Removed 52 rows containing non-finite values (stat_bin).

ggplot(data = filter(content, Service == "Hulu", Format == "Film", Origin == "Original")) +
  geom_histogram(mapping = aes(x = Year, fill = factor(Age)), binwidth = 1) +
  labs(fill = "Age")
## Warning: Removed 8 rows containing non-finite values (stat_bin).

Age rating for all Disney+ content

ggplot(data = filter(content, `Disney+` == TRUE)) +
  geom_histogram(mapping = aes(x = Year, fill = factor(Age)), binwidth = 1) +
  labs(fill = "Age")

ggplot(data = filter(content, `Disney+` == TRUE, Format == "Series")) +
  geom_histogram(mapping = aes(x = Year, fill = factor(Age)), binwidth = 1) +
  labs(fill = "Age")

ggplot(data = filter(content, `Disney+` == TRUE, Format == "Film")) +
  geom_histogram(mapping = aes(x = Year, fill = factor(Age)), binwidth = 1) +
  labs(fill = "Age")

Age rating for all original Disney+ content

ggplot(data = filter(content, Service == "Disney+", Origin == "Original")) +
  geom_histogram(mapping = aes(x = Year, fill = factor(Age)), binwidth = 1) +
  labs(fill = "Age")
## Warning: Removed 89 rows containing non-finite values (stat_bin).

ggplot(data = filter(content, Service == "Disney+", Format == "Series", Origin == "Original")) +
  geom_histogram(mapping = aes(x = Year, fill = factor(Age)), binwidth = 1) +
  labs(fill = "Age")
## Warning: Removed 55 rows containing non-finite values (stat_bin).

ggplot(data = filter(content, Service == "Disney+", Format == "Film", Origin == "Original")) +
  geom_histogram(mapping = aes(x = Year, fill = factor(Age)), binwidth = 1) +
  labs(fill = "Age")
## Warning: Removed 30 rows containing non-finite values (stat_bin).

Age rating for all Apple TV+ content

ggplot(data = filter(content, `Apple TV+` == TRUE)) +
  geom_histogram(mapping = aes(x = Year, fill = factor(Age)), binwidth = 1) +
  labs(fill = "Age")
## Warning: Removed 66 rows containing non-finite values (stat_bin).

ggplot(data = filter(content, `Apple TV+` == TRUE, Format == "Series")) +
  geom_histogram(mapping = aes(x = Year, fill = factor(Age)), binwidth = 1) +
  labs(fill = "Age")
## Warning: Removed 52 rows containing non-finite values (stat_bin).

ggplot(data = filter(content, `Apple TV+` == TRUE, Format == "Film")) +
  geom_histogram(mapping = aes(x = Year, fill = factor(Age)), binwidth = 1) +
  labs(fill = "Age")
## Warning: Removed 10 rows containing non-finite values (stat_bin).

Age rating for all original Apple TV+ content

ggplot(data = filter(content, Service == "Apple TV+", Origin == "Original")) +
  geom_histogram(mapping = aes(x = Year, fill = factor(Age)), binwidth = 1) +
  labs(fill = "Age")
## Warning: Removed 66 rows containing non-finite values (stat_bin).

ggplot(data = filter(content, Service == "Apple TV+", Format == "Series", Origin == "Original")) +
  geom_histogram(mapping = aes(x = Year, fill = factor(Age)), binwidth = 1) +
  labs(fill = "Age")
## Warning: Removed 52 rows containing non-finite values (stat_bin).

ggplot(data = filter(content, Service == "Apple TV+", Format == "Film", Origin == "Original")) +
  geom_histogram(mapping = aes(x = Year, fill = factor(Age)), binwidth = 1) +
  labs(fill = "Age")
## Warning: Removed 10 rows containing non-finite values (stat_bin).

Exporting the data frame to CSV

In order to make it reusable for other people, we also wanted to export our final data frame for following usage. We are also planning to publish this data set to Kaggle.

write.csv(original_content, "../original_content.csv", row.names = TRUE)